1 




1'*. • •• 
% 






I 




DOCUMENT .RESUME »’ 

' A •' ■ 

ED 066 463 TM 001 772 

I . \ * J • 

AUTHOR' Hoepfnier, Ralph* Strickland, Guy P. •• 

TITLE * Investigating Test Bias. * r 

INSTITUTION .. California Univ. , Los Angeles, center for the Study 

of Evaluation, ' 

SPONS AGENCY off ice. of Education (DHEW), Washington, D.C. . 

REPORT NO CSE-7.4 

PUB DATE ” Feb 72 .*•' 1 

NOTE 35p. *. * 7 . 

- i . 

- EDRS "PRICE * MF-*0.65 HC-S3.29 r 

DESCRIPTORS Achievement Tests; Caucasians; * Element ary grades; 

' t ‘ Ethnic Groups; Evaluation Criteria; Grade 3’; Mexican? 

f . Americans; Negroes; ♦Predictor " Variables; Reading* - 

Tests; *socioeconomic Influences; Standardised Tests; 
i ^student Evaluation; *Test Bias; frest Interpretation: 

. ' Test Reviews 

IDENTIFIERS California; Orientals 



ABSTRACT . 

* - This £tudy investigates the question of test bias to 

'develop an index of the appropriateness of a test to a particular 
socioeconomic or racial- ethnic group. Bias is defined as , an item by 
race interaction in an analysis~of-varianoe design. The sampler^of 172 
.third graders 6t two' integrated schools in a lafge California school 
district, * included 26 white students, 20 Blacks, 64 
• Mexican-Amer leans, and 37 orientals'. In order to obtain the initial 
information about item by race interaction, the Stanford Achievement 
Test, Paragraph: Meaning subteat was U9ed. Item regression data for 
six racial pairings were inspected: White 8/ Blacks; 

. Wb£;ces/Mexican-Amer leans; whites/ orientals; Blacks/ Mexican- Americans; 
Black/Orientals; and Mexican-Amer lean s/Or lent als. Various methods of 
establishing the existence and nature of* test bias are*discusaed, 
with the conclusion that test bias Cannot be conclusively 
demonstrated in a wholly satisfactory manner. One method was 
. nontheless selected and applied to test items administrated to two 
field-test schools for the purpose of investigating bias. The results 
of that' small-scale study are discussed, but do . not .offer compelling - 

reasons for the observed racial ethnic differences. (Author/LS) 

\ ' 



I. - ■; 



k:r 

* * 

v . 

» 

i 



i 1 



-i'L ' 



'/ . • — 



/ 



r ' 



. 

U.fc OCMNTMCWTOf HKALTH 

T11 ^ pFFJCE Of EDUCATION * 

THIS OOCUMENT HAS SEEN PtPlin. 
tuc'Ic CKACTLY AS RECEIVED FROM 
MATHUr^aS* ORGANIZATION ORIG 
ONS N S G T*T« 'J^ S ^° F VIEW °" °*'N 

*»bjs, atfiMsgiss 

CATION POSITION OR POLICY 



,.v 



t 




■ iu ■ 

// 






INVESTIGATING TEST JIAS 

.» ) ' ■ 



* l 



By- 




Ralph Hoepfner and Guy P. Stfickland 

« 



CSE Report No-. -74 
February 1972 



.t 



. r».’ 





School Evaluation Program 
Centey for the Study of Evaluation 
UCLA> Graduate^ School of Education 
Los Angeles^ California 







iMlM 



• •• 



I • * 

l 

1 



t 



y f v. 
£■■ . 



i 



n 

/ 



// ■ 



t 



• -/ 
// 



v; 



The Center for the Study of Evaluation’ is engaged in numerous studies 

N. ’ 1 ' . * 

involving the dissemination and application^!: evaluati^rjnethpdology in the 

• nation's classrooms. Th£ Center has designed workshops^ and kdW to educate 

; ; . , -■ , , i. " 

school personnel in evaluation techniques and, at the same/ time, has tried , 

• ; ~ . V J . • * ' 

to present those .evaluation techniques to school perserfinil in aform both 

‘ • • • • . * . ' 

appropriate and useful to them. The most commonly used technique in educfa- 

f . # 

tional evaluation, at this. time, is the measurement of .student performance 

t . , * • 

.through use of standardized test instruments. In the C?E ^Elementary School 
( ^ TeS^Evaluations (Hoepfnjpr, R. , Strickland,* G. , Stangel,G., Jansen, P. , and 
Patalino, ,M^,. 197(J) , theNSenter'has provided information oihpver 1,600 pub- 
lished' test and subtist^ for use, in elementary schools. On the basis of 
this, information , schftol principals are enabled to choose the teste 1 most 

sui table. for i^r needs in their, schools. . <* f 

• • * . 0 . * . 

The tests were evaluated and rated on 24 criteria; six of these criteria 

♦ / t * ' . 

- - • ' 

‘were concerned with the tests' 'appropriatenesses 'fof the students being test- ^ 



ed. This "appropriateness" was interpreted, for the sake of simplicity and J 

# < * ' I ' . I 

generalizability, as appropriateness to an average class in each grade’ (tests 

( * . * 

for grades l|*3y 5, and -6 were evaluated). At the same time, the Center • > 

acknowledged that some tests may be' mor§ or less 'appropriate for certain 

socioeC 9 nomic or ethnic -groups. t N 'V / * • 

* Many school districts^ particularly in lower socioeconomic and highly • v , 

t *4 

ethnic, neighborhoods, are under pressure from teachers and/or parents to \ 

abandon testing completely became the tests are felt to. be inappropriate 

for 'their coitmuni ties. If one Considers that testing hass two functions, \ . 

/ ' ' . ( ’ . • 

pupil placement and the, measurement of program effectiveness, the abolition - 

r v * • 

of testing can be seen as being potentially beneficial in ending the abuse^ 






4 ^ 






* \ 



* 



• >« 



•V 



i. ^ 



l'\ - 



V. - 



s 



c .'.I 



in pupil- placement due to the use of inappropriate an#* biased tests; but ' 

■ - * * - 
- * 

such abolition would inhibit measurement of program effectiveness, thereby 

inhibiting progjan? improvement and program change. This is particularly 

tj^ue *when the test is appropriate (referenced) to the instructional program. 

* ^ 




• The Center felt it necessary to investigate the question of test bias* 

• r . • 

with the intention of developing an index of the appropriateness of a test 

j ) i 

/ to Jt particular socioeconomic or racial -ethnic group. This would be a val- 
* * 

uable appendix to the CSE Blbmentary School Evaluations , expanding its use- 
fulness ‘td* a broader range of school and pupil types. Our jntent was to 
4 isolate either the tests - or the test items that exhibited characteristics 
indicating bias, induce the aspects of the tests or items that are common 
to the "biased" measures,* and from those aspects develop a quantitative 
index of vihat might be called ’’predicted bias” which cob Id be generalized 
and applied to a wide range of test instruments. * ' . ' 

i 

, « * * 

Approaches to Measuring Bias 

In attempting-' to develop its own bias' indexes, the Center considered 

4 f s ♦ 

several approaches that have been described and employed in the literature. 



■) 



External Explanations of Test Bias ; 

* . *’ j » 

Several yof the procedures for explaining^ or' establishing the existence 
of test bias depend upon criteria external to the te^t in§trument itself . 

Characteristics of norms and validities are examples of this type. 

* ‘ , / 

Bias by Norm Sample . Certain tests may have i different norm samples 
that 1 cause differences among racial and socio-economic groups. IfJLt is 
assumed that, -relayfve to a particular objective- of achievement or - pTptitude , 
there are not systpKatic and reliable differences in underlying standings ' 



& 






T~ 

\ 









V 



1 . 



f . . : 

I 



./ 



r 



■ t\ 



r 

O 



ERIC- 

U3Q29Bfl9 



*Y 



. • . ■ r ■ , 

i • ! • ' ■ ; ■■•.. - 

among the yarious subgroups in the population, then for a Veridical test of 

. I *" i « ' t» 

that aptitude or achievement it shouldn't matter whether the noim sample 

‘if , 

systematically excludes members of any subgroup; the obtained noims vjould 

. ' ' . ‘ ^ I- ’ 

be more or less appropriate for all subgroups in 'the 'population. 

• ; % v ^ 

But Millman and bindlof (1964) have shown that different tests with 

1 

different norm samples yield markedly different percentile norms by the 

. j: ■- . ; 

ecJw4‘P€rcentile method. The specific differences had been previously 

, <• if.VJL* ‘ j « 

•noted by several investigators' and test users. More to the point, however, 

j , * ■ ' ’ ' 

Eagle and Harris (1969) showed, significant differences in white-non-white 

• . I • / ! 

comparisons for two different tests. One test -could be said to favor whites 

j . ; 

much more than the other. * These findings may have been due to the noims 

' , • . ! ,, .. ‘ 

used in. assigning grade equivalents, to 'differences in test content or for-' 

mat, or to basic racial -socior economic differences. ^ 

/ ‘ i s . iv 

To the extent that the underlying .ability or achievement is not equ^l 

I" *. . • *’ 

' among all the subgroups or the test is a poor measure of the underlying 

* - ; • . 

status, the noims are ^^appropriate for the excluded subgroup. * Si^c^ the 
former, case is not wholly consistent with notions of bias, we can look at 
same of the ways that* tests can be poof 1 measures of the underlying status. 

• Jn the construction, of tefet instruments i£ is useful to distinguish 
between the noim sample Utilized for a test and the, pilot sample used in 

it » • ■ . ^ 

developing the test. If the pa lot sample, excludes subgroups (wjiich it * 
usually does, in an effort te^xncrease the : economies of test development), 

• I ' ' ' 

so that the statistical characteristics of the items reflect only, certain 

j : .1 » - >' 

y <V* v * . , , 

subgroups, then the items selected for the test will be items appropriate 

/ . . • 

(in teims of difficulty, external .discriminability, and item content) for 

• ' » > ’ / ' '• . * 

the Subgroup utilized.! As a* result,, the content validity of the test may 

be different for the different subgroups. But if we continue to ass line 






t 

3 



$ 

5 

JV 



r 



0 

% * 



t * 



c ^ 



< 






/ ‘ 



N 






. that the true performance of all . groups over the objectives of the test is 

« f. * ■ 4 ' * . ’ 

the Same, then the process of item selection based on* a pilot sample from 
' . <* .- 
one subgroup only is just as likely to yield higher or lower content valid- 

ity for the excluded subgroups. The presence of bias ^ is always possible x 

when a test designed for one group ^is given to another; that is ‘an experi- , 

•mental truism that transcends any issue of racidl bias. Bpt the hi as is* 

not a- systematic one; it does not 'consistently or by design f avoir one group 

\ over another. ' 1 ’ , * 

On the other hand, if it is assumed that some; ethnic grdiips have lower 

• ► > 

ability on a particular objective than other groups, then poor -norm sampling 

, » ' - • 

can create or accentuate the bias. Many studies have found that score means 

• . • • , 

K ■ . * 

for racial minority groups ai*e lower 'inti standard deviations smaller* than 

those for the majority white group.. Were -this phenomena constant for cer- 

# « +» 

tain types of tests, then it would follow that any norming of the test must 

• ■ • * 1 . 

fiave appropriate subgroup ‘representation if the bia6 is not to “be enhanced 

\ . . ' r ' - ' ... 

or increased through use of . the Doming procedure. A disadvantaged minor- 

.A. ’ ' » *■ 

ity student who takes a test normed on' white children will be given** a score , 
, that is too far below the mean because Jl) the mean is artificially high, 
and (2) the distance in standard-score units is artificially inflated be- 
cause the unit of measurement (standard-deviation) is too small. If the 
minority group is superior, to the white norm group on the objective "per- 
formance. the foregoing® function still holds; only^t he bias will favor the^ 

eaJrl 



minority, group . ‘ Such phenomena would clearly not be .cases of test bias:, 






however; they would be cases of racial differences on. the objective^* under • 



consideration, merely being reflected by the test scores, 

/ ■ ♦ 






Item selection based on data frpm a podrly normed pilots sample, ifi a 

*‘ k i. • " J . j / .* ’ 

situation where ethnic subgroups are assumed to.be of unequal ability, 

* 

does not by itself ensure that there will be bias, ..There is.no reason to 

' ‘ . ** ’* , . . . 

believe that the process will cause selection of items biased in favor, of, 

. ■ *. ’ • . . 
the pilot group in such a way .that the pre-existing differences between 

. ' '*• - A ’■ 

ethnic, groups’ arp systematically increased or decreased. Again, the cdntent 

validity may be different fof '“different ethnic groups. 

r-.. r • ■ • 

Bias b%_ Predictive Validities . Certain tests may have . different pre- 
dictive validities -for different racial’ groups by underpre4icting or by 
overptedicting minority -group performance./ Moderated prediction (Einhome 

and Bass, 1971) may be called for for the different groups , _ yielding different 

. \ ^ 

regression functions. This' differential validity proposition, however,- is 

T 4 

• , . . ■ • ' % 

based upon the inequality (read differences) in the subgroups. In addition, 

* # * t 

certain tests may suffer in predictive validity due to inappropriateness of 
comprehension level ‘or test content . * • ' 

• t * 

Temp (1971) found that, employing the 'Scholastic Aptitude. Test (SAT) 

C ■ 

as a pre<^f£pr of GPA, black and white regression equations were different 
and that black Gt>A, when estimated^from the white regression' equation, was • 
overpredicted. This, two -pronged. approach addresses the issue of bias via • 

prediction in two ways . First , Temp essentially considered' his black sample 

, / . 

•and his white sample as independent populations (therefore , not necessarily , 

* . * * ’ • 

■ % i • 

equal, the same, or even similar) , each giving rise to regression parameters 
that could be coippared f on differences. Second, the samples were considered 
to be Subsamples of the same population; and, if race were the only selection 

variable (and also not influencing regression phenomena) , -the null hypoth- 

• • # * * 

es is would be that utilizing one subgroup's regress ion * equat ion on another's 



y • ' 



* 






s cotes and achievements would not re'sult in systematic *, under- or' over- 

‘ *' ■ , c . • v ‘ & • i 

predictidn. Of course, it -did result in overprediction, but race alorie 
* ’ • * . ' *■.-'* • • . .• 

may not have bean the’ cau$c, as observed race is confounded by many social 

- . *. •’ ' •* .. ’ • 
and environmental variables. ‘ - 0 ■ 

^ . •' . 

Differences in predictive validities were noted for the Negro and' , • 

' . • ’ .’ ' • • ‘ . ' ■ , ' % 

white samples in the study by Goolsby ancUFrary (1970); ■« Such large' and. ■ 

* ** *r »* # * • | • * ' 

‘systematic differences were not noted when the study sample wAs regrouped, 
in terms of sex or schooltreadiness . The validity difference, computed 

- separately for black and white groups, were- not, hot/ever , all' in f&vor of 

‘ * • ** . ‘ . 

' ^ . .• > o • «• * ■ * 

.hl^ier predictability of the *-whfte sample, in fact, 46 of 70 of the valid- 
ity coefficients were greater, fqr the” black •sample*. In a like manner, 

* p , .* ° m * 

Mitchell (1967) found about, 'an even split (26 to. 19) ^n greater predictive. 

$ Wf* * J ( 1 : * t 

validities for black arid white sampled, respectively*. 

f * 

Linn, and Werts^ (1971) discuss some of die technical problems 'that may 

*c' • .• • * . ’ 

account for differences between subgroup regression’ equation^ . - These prcbf 
lems bre predictor unreliability and exclusion of certain, predictors. 

, . r \ - W . , ' * ■ • ' 

*. * * l. 

Linn and Werts are the first to explicitly point out the necessary equAl ity 

.. ' • < • ’ 
of criterion performance in order to compare regression data to uncover the 

effect^ of test bias. In a similar manner, Thorndike (1971) sug^sts sev- 
eral alternatives for handling test bias, depending upon how^itSis defined, 

t m 

i ’ * 

but also suggests ^that criterion bias may play a larger role in the observed 



V 



sub-group- differences . Thorndike’s conclusion closely parallels the one we 

*• * * - * 

# > O’ 

shall reach: , . 

f 

> . • . \ • . ■ 

”If the criterion measure is itself biased on an unknown direction *; 

ahd degree, no rational procedure can be’ set up for ’’fair'! use of* 
the test. To determine that test scories ifi the two groups predict 
a given criterion ratii^g. is fruitless if the^ criterion rating does, 
not really mean the same thing in’ the major an&minor groups .' And- ‘ 
.by the same token, settingup group. quotas based on proportions in 



* m 



, 4 7 






4 \ 



* o 
' % 



j- ' 



, \ 

• y. .'/ 



/ 



• , 
■* 



■ ♦ V 

" # 



I 



K 

♦ 

4‘ 



W i 



♦ 



prfevious major jand minor groups, that have achieved a specified " 
criterion rating is ‘fruitless * if the criterion rating Signifies 
• different things in the_ two groups."* (p. 70) • v, ' 

• • • * • , . • , 4 ’ 

. If, indeed, we reject. 'the current definitions of test bias , and are. con- 



sequently forced to accept criterion bias, then Thorndike's conclusion’ is of 

>>. * v . . 't ; : 



‘far greater general consequence than it at first appears . 



t: 



The. investigator “of test bias can determine the nature of the bias , of 



V 



/ » % t 
% test, as a predictor , by examining those systematic ove£- o launder- 

. ■ _ ■ 



'predictions of. some criterion tperfomance by the^te£t«. In. order to uncover 
any bias in the predic.tor, however,' we inust assume that the distribution of 






* •• 



.... 4. 









>n 



i 



criterion performance of . the sample (the minority group that may tie being 

.detrimentally 'assessed or classified by the supposedly biased,. predictor) 

’ ’ ’ * ' V < ■' 1 ■ 

not itself a",b i as ed - s amp 1 ing on the criterion performance on the popula- 
tion (the larger group;, upon -which tiie validity of the* test has b6en deter!- 

' . r < * * ’ . . t i . ' ' ' > . . * 

mined). - What' is meant here is essentially this:. T' we wish to dletermine 

c ‘ - h 

« * • ( 

. whether or not the ACME* heading Test is biased f<?r the educational assess- V 

- ' • - V \ * . 

ment and placement of ytfung black learners, or whether the ‘ ACE Typing Test 
is biased for selection or promotion of Mexican -American. Vletical job appli- 

. * * + V *0 

cants', it is necessary to accept the unbiased nature of tlie ,"t.rue^' fearing 
£ V :**•' v ‘aptitude of the learners or the "trufe" typing skill of the 'job applicants.. 

■ £ It As, crucial to assumd that their "true" performance (not necessarily 

. . O • * 

f , • • t ’ * • • 

0 , 

4 faui tyv measures of it, for that compounds the. problem) is in no way sys- 

tematicftlly different froni the norm qf the population, or else one has to 

v. ' i 

, ?• H * 1 * » 

"adjust" -for those, systematic differences (if they are known or can be con- , 

*• 1.. ■’ 

* , 9 " f ■ p H 

fideiitly estimated) . Whether or not -the assumption of equal criterion per- 

* * 0 * * * » . 0 -. . « 



7 



i 

>4. .■ 

.. h 



• ^ . i, •» k * t - - 

fo^nance is true is not known,’ but it is a reasonably safe null hypothesis 

• ' ’ N k' 



frqrn both a statistical and an egalitarian point of view. 



• O "■ . 

'Ui!< - . 






*0 

J. 



. i 



•> 

1 



l ■ 






*) 

i 



• ? 



We can understand these cbnjolex phenomena more cl&arly if they are' 

> 



illustrated through scatter plotSr that are commonly used to represent val- 

• i * '. ■ V ‘ <S 

Nidation data 5 Figure 1 illustrates such a sddtter plot 'for a hypothetical 

■ , :• - * x 
population, where each member predictor score is cross -tabulated with his 



achievement-performance score, For the sake of' simplicity/ let 's .say that 

t * ^ ^3.. ^ ^ 

the correlation (validity, coefficient) is +.60.* ” / r -*-• - 



/.. . 



/ 



/ ■ 

/ 



Vi. : * 

C. ^iteri Ojr , 
P" ‘{or ' a :ice 
JViea ■■») ‘e 

• .0 ' . 



> 



. 1*- 



X + 
+ + 



' * i K 

» v T '> / ■ > » 

•p : + \ 

+ - .A +• 

+ . / + 4 

4 ' V 



/ 



/ 

X 



X A " 



P:*e e~t S : ' 



% * 



4 . 



V 



r 



V 



igure 



. Scatter Plot of Predict! vW\ Validity Data for a Population 



Th6 regression line , R, has been drawn in Figure 1 to indicate the 

* ■ . , ' . . - ' * ‘ . ’ - t' 

best estimate- (in a^leas.t- squares sense) of the criterion score if only me 

p * • k » i, • 

•predictor score is known (one merely locates -the predictor score on the . 

. * H o . , • i 

4 ' . i f . 

horizontal axis, projects tfyat point straight, up until ij intersects ft,' and 

• j, . • 1 r . ’ . / * 

then -projects the intersection left to the criterion score axis). ' Uf : one 

„ '■ * « - . c • ■ . - 1 - . . 

were to follow this procedure with any- random or stratified- random sample .- 



V 



V 



o 



- from the population, .the predicted scores would, of course, ’fall on the re- 
gression line and' the actual criterion scores would be randomly distributed . 
above and below* the line. .While it is, moreover, safe to say mathematically 
that no matter what kind of sample we select the predicted scores will fall 

. on the regression line, it is yiot so safe to say that the actual criterion 
• . . , ' • <» 

scores would also be randomly distributed above and below that line. 

r ' « / 4 ~ ' < 

• % . # i • i 9 

But the assumption of the null hypothesis — that criterion perform- _ 

* ance is essentiall)^equMv among, all groups leads to the conclusion that 

-- c * 

non -random deviations cf the predicted stores from therpgression line will 

,i . ’ • , ' * < * - ^ 

not generally occur , except through faulty or unlikely sampling. Figure 2 t 
illustrates the impossibility of this phenomenon by posing" a case where there 

appears- to be over-prediction of success by the regression equation developed 

• , , * * ^ \ • 

m 4 9 

from Figure 1. , 4 * 



■T>. 






/ • 



O' . 

ERLC 



«'* -i*; -ri*' . 
P« -o. ! a 

Mw r v —n 




. Figure 2 -* 

Theoreti cal Scatter. Plot of Predictive. Validity Bata for a. 

•Minority Group ’ 



f 



10 - 



The reason that Figure 2 (or a figure with a. similar underpredictor 
: « bias) cannot be expected to occur is that either demands, contrary to the 

null hypothesis, restriction *on the range of criterion scores; after all, 

* / 

in the fipst case (over-prediction) tlje sample exhibits few high criterion 

» • ■ A ;C- ' • l 

scores. V^GCases of restriction of range of the predictor scores are dif- 
ferent and are treated in a ldter section.) It becomes obvious that for 

i y 

cases of either systematic over*- or under-prediction, we have -cases of 

•* * t ... 

biased criterion performance, not of biased test! It can be seen that 
Cleary’s (1968) definition of test bias :i * ... 

\ V » 4 

u A test is biased for members of a sub group., of the population if, 
in the prediction of a criterion for which the test was designed, 
consistent nonzero errors of prediction are made for members of * 

the subgroup. In other words, the test is biased if the criter- 
ion score predicted from the common regression line is consist- . 
ently too high or too 16w for members of the subgroup . With this ' 
definitidn of bias, there may be a connotation of "unfair' V par- 
ticularly if the use of the test produces a prediction that is too 
low. If the test is used in selection, members of a subgroup may 
be rejected when they were capable of adequate performance. 

is really a definition of performance superiority or deficit or of criterion 

* *. * - * 

bias, not of test bias. . . • 

i. - * • s 

Two alternative phenomena could -occur, however, that might indicate a 
biased test; first, differential prediction of the criterion^ and second, 

y 

truncation of the predictor score distribution. Since our null hypotheses 
demands scores throughout the range of criterion performances, there are 

. . .• V. 

only' three types of mispredictidh that could occur with ^ predictor having 

-? * - r 

a full range of scores. These types are called "dulled" prediction, "mis- 
prediction", and "reversed" prediction, and are graphically illustrated in 

** « 

Figures 3, 4, and 5, respectively. 



£ 
/ * 



t ■ 



i 

> 

! 










Criterion 
Performance 
-Measure • 



r 



r 




Predictor. Test Score 



11 



•' .Figdre 3 

Scatter Plot of "Dulled" PredsKction Validity Data for a 

Minority Croup / 



. f ' i 



% 



r# ' 



&■ 
&y 

' ft 

O 

IERJC! 



Criterion . 

Performance 

Measure 




Predictor Test Score 






Figure 4 



Scatter Plot of "Misprediction" Validity Data for a. 

Minority Group 



' It 



/Effects of Sycli' truncations on prediction are potentially widely varied. 

* * \*i» i f> ■ t » 

0 But one doesn’t have to look at validity data to determine such bias; a . 

* . f * . < 

* £> ... ‘ ( 
simple 'test of the differences of the predictor scores for the minority 

subgroup willQreVeal this bias more sensitively and more accurately. 

, * Goihg back to Cleary’s (1968) definition of bias, above, we cam look 

* „ „ * . * 1 
s at regression lines for subgroups separately and compare; them,, both as no 

they predict\the unbiased criterion performance and as to how $ach sub- 

. v •' ■ .* • \ ■ 

group’s predicted performance systematically under- or over-estimates its 

and othervsdbgroups' actual performances. V .• «* \ 

• ‘ - • • \ * 

, • In the first case, where the subgroup regression is compared to the pop- . 

- * «■ ' ■ . , . . * * „ ■ \ i 

>. ulation regression, the phenomena described above obtain, the only difference 

•' > ■ \ ' 

being that regressions are computed and compared instead of noting scatter- 

plot differences. Likewise, in the second case, where subgroups' regressions 

• * ’ « 

7 are compared and systematically cross -validated on other subgroups, the same 

t ** * . • * t , • • 

• » / 4'$ 

phenomena occur. , v 

What emerges frpm this exhaustive review of the possibilities of bias 
in prediction is this: if “criterion performance is unbiased and equal across 

subgroups,' predictor studies can reveal only trivial examples of test bias. 

‘ Tfie hey to - the meaning of this conclusion is, of course, the assumed unbiased 

. \ - ^ v r • 

andfiqual criterion performance. To the extent that the "true" criterion per- 

•formance is not egual among subgroups, or measurements of the performance are 

■ ’ ‘ » ■> 

not equal (biased) , .then all the classical approaches to predictor bias take 

. I # , 

• , • 1 * * < 

on meaning but usually the wrong meaning; the criterion is biased, not 

• * ‘ 

the predictor. 1 

* 

In terms of describing what happens when there ii 
performance or its measures, there is no value in mak 



the population and one (majority) subgroup (the distinction is certainly 
crucial, but not to the development of the prediction events to be described). 
For this reason, "population" Will be used to represent either the real pop-? 

' ■ i . ' * 

. , ♦ 

ulation or' the ^vajority subgroup, and "subgroup" will apply only tp one of 
the minority subgroups . Given these definitions,” two criterion performances 
are possible (equality being discussed’ above) ; eitheij the population perform- 
ance is better then thqf for the subgroup, or it is worse. . j 

If the population performance (or its measure) is better than that for . * 
the subgroup (caused by hereditary «r environmental selection or effects) , 
scores pn a» valid predictor for the population can-appOar to be biased in two / 

9 • # 1 t| ’ % * y * 

i . - , , * 

ways. WhetVjthe predictor is also valid for the subgroup^ igure 6 shows what 
will occur. When tne prediction .equation developed' fromthe population ^ > 

used for members of the subgroup, the procedure will overpredict the success 

. * ■; . ' v •’ r *. • u- ’ 




of the subgroup , causing the selection of subgroup people who would not be 



S r . 

selected from the population aiid who have a high probability of not meeting 
performance criteria standards. When the* prediction equation developed from 
the subgroup is applied to members of the population, the success of popula- 
tion members will be underpredicted , causing the rejection of the population 

j ' • » , 

people who would be selected from the population and who have a high proba- 

\ - • 

bility of meeting performance criterion standards. 

•* t • 4 

^ When the predictor is not vajid for the subgroups and the population per- 



formance is better than that for the subgroup (illustrated in Figure .7) , we over- 

s' ‘ ' ' ' . ' ‘ : ' 

predict population performance by utilizing the subgroup regression; the error of 

# ! , • * * 4 
prediction being systematically larger for high scorers than for low scorers. 

V - \ : . ‘ i ' . \. " • 

-.In the event that tne subgroup performance (or its measure) is better 

*• 

than that for the population (an infrequently observed phenomenon) the errors 

* , • * 

.■ . ; • 5 ’ • ’ • • • , ’ ■ 

’ . . * 



- V 




■ Figure 7 

Scatter Plot of Predictive Validity Data for a Test Valid 
for the Population and Invalid for the' Subgroup, with the 
Subgroup being Lower on the. Criterion Performance. ' 



* s -■ • • ■ - 



16 



of prediction will be in the opposite direction of those discussed above. 

What emerges -frdm this review of the possibilities of bias in predic- 

o' ' 

,tion is this: If the criterion performance (or its measures) is biased for 

• .* • j • 

one group, prediction studies will invariable reveal the apparent and oppo- 
site bias in the predictors for that group. - * w ■ 

• ■ . . v ' 

The final set of cases ir. which tests, as predictors, can be investi- 
gated for bias occurs when both the criterion performance and the predictor 

, ft - * * 

scores are different between groups. Four such sets of inequalities can oc- 

* 

■ * 

cur , in degrees , but consideration of them, will be limited to extreme pxam- 

■ - ' - ' « J\ . 

^)les for purposes of clarity. . \ . 

.When criterion performance*/ (or its measure) is higher for the population 
than for the subgroup and predictor scores are also higher for the population, 
the situation is illustrated in Figure 8. In part this situation will not 




Subgroup , with the Population Superior to the Subgroup on both 
the Predictor and the Criterion Measures 






f;; 



w 



i * 

Uv 



» )Y' 



% V 



r 



W-: 

i 



ie 

i 



§ 

k 



% 




v 






* . </ 



p* » # 



17 



,?• /'; /-*•/... • • . • .• . • 

yield effects different from those associated with Figure 5, with the exception 






that still fewer subgroup individuals will be selected (although it is also 

j ■ . ' . . 

true that .fbwer of the selected subgroup individuals will fail). The opposite, 

of course, is^tru^ in the* unlikely situation that subgroup standings on both cri- 
terion performance and predictor measure are higher thay those for th^jopuldtion. 
In the case where the subgroup jp predictor score is higher than that for 

the population, but it^ criterion measure is lower (as illustrated- in Figure 9), 

• /' 4 ■ .VV 

there would be ,a great overprediction of the success of the subgroup member^T^ffd 
a large number of failures in performance;, a case jof charitable disservice. 

. ' . ,, .. ' ’ V 

In the opjJosite case, where the subgroup's predictor scores are l iwer 

’h\ ■ ■ ■ . 

but its criterion measures ajre higher than the population's, we h^ve^the % 
ultimate In uncharitable disservice --- too many subgroup individual Niho- 



would be successful will not be selected. 

. • j- - fc « 




t 



•’V 

0, 



Criterion 

Performance 

Meifsure 



gj , 



■r/- 






• ,y 






; 




‘ |l 

^•>•777^-*: -v.j 



* 



y 



Predictor Test Score 






+ =4 population score 
oup score 
ft = dbffclation regression 
R'~ pubgroup regression 

> 



_/ **N* 
0^ 



/ i 






♦ 4 



>1 



o 

ERIC § 

j ' ft: 



«■ • 



Figure 9 



Scatter Plot of Predictive Validity- Data for the Population and the 
Subgroup, with the Population Superior to the Subgroup! on the J" 
Criterion Measure, but'. Inferior on the Predictor Measure 



V* 



20 



• 7' v ‘ 



From these reviews it can be seen that test bias can’ only be demons trat 

' SJ - 

dd through prediction when the criterion. itself is unbiased;. bias of the pre 

♦ * * • * . . 

dictor alone will not be uncovered by predictive -validity studies for differ 
ent subgroups of the population., , ^ 

Iritemal Explanations of Tesi: Bias * 

• # * '’<■£/ 

A second approadh to the identification of bias is through the analysis 
. ' * , 
.of test-intemal characteristics... The tests ’jthdmldlves are analyzed against 

, < ^ p .. * 

racial -ethnic groups to determine inherent test ch^f^cteristics thatmay 

" 4 . 

cause or explain the bias effect. ^ ^ ^ 

• < 

Differential factor S tructures . Certain tests may have differentiae. 

‘ • . , f. 

tor structures for different ethnic groups which could account for 

- ■ " ■*' -v , ■ . . "... 

ences in observed stores, . r Goolsby land Frary (1970),found just such factor 

% |? > • * ^ * I ^ “V — y 

structure differences ,^but their factors were largel^^eternyLijed by achieve 
ment measures "an^ their predictors , predictive validities^ known to be dif- . 
ferent for the two groups of 4 white and bjack school' children . 



differ^ 







On the other hand, numerous studies 

kv 






- « v 



.that there is no differ- 



ence yin the factor- structure df intellectual abilities among the. racial 
} ^ ' '• \ : . * • • 
groups. These studies find relatively invariant, factors among 1 groups,' usu- , 

ally with one or two culturally -bound factors of a verbal sort"’ (indicating 

- , • . \ . 

language fluency or coninon experiences) also' found. Vandenberg_(1959) ,• i 

. j* , t * ’ * •. 

•his systematic study of racial differences and similarities, found that /. 

4 ® . . . *Jy\* . / +? 

Chinese students in the U.S.A. exhibited the same "factor structors on 

A fc • ■ * 

Thurston^' s IMA tests as American students. In a later study (1907), Van-. 

'denberg found high agreement of factor structure between South American and 
Chinese students using the sane test battery. v 

• . ’ ' . o , 

’ r <f . .Guthrie (1963) \found<r that Philippine wfimen college students exhibited 

" ■ .. : , *• ■ ■*' " • - 

factors very siMldi^t^?ftiS»a\found in western culture. Johnson (1969) . • 



t 



% 



i fe.' 



II 



Jn-T 

fe 



ERIC 



EJMamunmiii 



j 






\ 



’•.r 
... .& , 



19 



found educational abilities and aptitudes factorially invariant with scores . 
from subjects in Rhodesia and Zambia; and El Abd (1970} found similar factors 

k - * t , J ^ . 

of intelligence in samples of East African students- and American samples. 



/’ 



ElvAbd concluded that there are no basic differences in. the structure of in- 






telxigence across the races.'- * 

^Addressing the issue of socio-economic level, which is a potential 

M * t, t* . 1 



canse for* any observed di,ff erences . among the races, McGaw and Jdreskog (1970) 

V • *."w * • ~ r *. 

*■••••:* - v -*• * * . , . 

utilized a. large sample from the Project TALENT study, and divided the high 
school students into four groups on the basis? of high and low intelligence k 



and high and low socio-economic status. Factor analysis of 21 aptitude-test 

■ * ' ' p -* . . • \ ■* • .*• • , \ > 

scores revealed-^imilar factor scores across all four groups, but factor in- 



terrelatedness was found to be higher* in the high intelligence groups. 

{ < \ * . 



•a. 



Differential Test Scores .. Test' bias may be defined hs the interaction 

of race and test in an analysis of -Variance or by other tests for the.^ignif- 

• ’ ^ * , * 

icance of differences in means or standard deviations among groups . --These-- 

’ ' f *“ " ‘ ‘ ~ 0 • 

statistical techniques use the scores of individuals from a variety of races. 

■ * . , * ■ r # ^ • \ l ' 

on a variety of test£> determine •‘’the* amount of variation in scores that ife 

■ ■■? \ ' ' • • 

accounted for by differences id the test4 alone, the amount accounted .for by 

differences fh the racqs alone, and the,.amoi!ht attributable to some conibina- 

■ ■ 1 . 1 fe 

tion or interaction of both race and t£st differences. f The approaches do < 

'Xm * •• * 

not address questions of equality of test or equality of races; they look in- 
stead to -significant differences among test; scores and races and therefore 

4 • ■ * 

will determine only L vfeidi* tests are relatively more or less biased for which 

„ * * .O . ’ 

groups'. ' The ultimate question of whether or not (here is a bias goes un-r 

■« • • -j ' * ■ . . ■ 

answered. The logistics involved in this approach are fairly staggering; 
each test under consideration should be / given to each subject. The number ^ 



/• 









22 






i-V 



a V 



V: ' I 



i . 



V: 

* ’■ 



' jtc 

.6 



'M. 






ERJC 



-V- i 

* 



■ \ • 
\ 



20 



of subjectf would have to be quite large in ordOr that other variables 

• • « . 

*(IQ, socioeconomic status, parent education level) be controlled. Further-' 

' * ° * * % * •* 
more, as noted by Eagle and Harris (1969), who used this approach in a small- 
. ■ * 1 • . 
spale study: , 



''Though this study strongly suggests the operation of- 
significant Test "x Culture interaction effects, ’the specific 
s test' characteristics Xcontent, cognitive function, , technical 
features * such aS. speed, etc.) and specific socib-cultural 
characteristics (ethnicity, economic class, attitudes, o meiital 
ability level) which may- fee ‘entering into these interactions, 
have j»ot; been examined/' • A * ' + 



Such control problems as these do, however, inhere in all studies where 

racial-socio-cultural characteristics- are Examined. The effects of lack of 

' . ’ } ’ ’ •:* : * 
control are simply more apparent when small samples of examinees are employed. 

. _ A simplified version df differential score approach is, of course,' to 

merely. compare test scores between groups ^/The control problems will dis- . : 

1 * 1 * k W *■ • 



V-allxy iricisivg: investigation; into any nature -nitfrture issues , but will > eon-, 
centrate instead upon the observable phenomena v of score differences. Such 



.V\ 



total score differences have been observed for a long time and in a fairly 

• / V * 



by Cleary- and Hilton; 1968 ) , subjects from various races would take one com- 



mon test, and the Variations in their item scores would be analyzed; some of 



the’ variations would be due to difference in items (s< 



items are more dif* 

• < ■ ' * 

ficult than [others) ; some variations would be due to raceVone group might** 



r\ 



consistent, mannejr, but the^^^not answer Questions of bias. ■ • ^ 

• - ■ 7 

* • Differential Item ScorMP Ftn the item approach to bias (exemplified 






have higher apparent intellectual or achievement' status than\|mother) ; and % 

• v . .. * ' • ; •’ ■ . 

some variation^ would be due to 0 interaction of. item and race (some items might' be 
relatively more difficult for one race, than for another) , and would there- 
fore' show bias. 






1 v 



23 



i 



I^Vs* 



V. 



« 



* *•#-.* • ' 5 

. It should be noted fcere.that this approach adopts a unique definition, 

• * - \ * •*. . * . 
of what bias is ; A the definition almost solely dictated -by the intent of 

Cleary and Hilton's study. That study, was directed, like .the' one' to be re- 

ported later, at finding the parameters^ of biasedness, ’an approach which, 

'bt-^t, ’can be. relative. The relative nature is most evident in the fact that 

© * . . - ■ . ' », s ' . 

items -can only show bias in .relation to other items. Tor example, if item A 

• /* t • » v v . *• .--V, t : 

exhibited highly significant differences In difficulty over two racial groups 
it would not' be considered biased (in the .relative sense) unless that differ- 
ence is different from the other item difference's . .* 

* / . <r • 

Defining bias in this -manner necessarily excludes some portion* (possibly 
.most) of what it. is that bias is (i,e, the overall and consistertt 'unfair dif- 

' . • . ‘ * ^ - . , N 

fefenqes • in scores over the racial groups). The' method wil.’ uncover distinc- 

. * i * y „ , „ * H 

• • r ' * * , . 1 ^ f . . ' \ 

tive or unique bias effects, but will*- leave unnoticed the overriding bias. • • 

* ■. ■ 

•effect’, which is , ^incidentally, the subject of major social ‘concern., The 
hope is., of course, that by j.solating*iiany of the unique bias effects over " 

,o • . ' • * ® . • v 

, < * • . » , *v 

.many studies, there will be.icxinvergence upon the overall bias effect. 

. % - - 

Once biased* items are identified (it^ms contributing to the significant 

s • 0 ' ; i’ • • ' 

* « f 'A • ♦ - * 

item x race interaction) s it ,wot£!Ld be possible, to make hypotheses about why , 
they are biased*. These, hypotheses <could be verified by- field testing items 

Mth the same , kind, of characteristics, to .see if they yield high item.-by*race 

’ . ■ * . # ** «• 

» * t - ^ ' ‘ .*► 

interactions in analysis of variance-' * 

# * > ' . ' 

* ' Isolating the biased or differentially differentiating items 'should dead 

... ' . • 

to the study of. their cljaracteristics that might underly the observed bias. 

Bernstein and Chamberlain (1969) have investigated one such item character.- 

. ' ' * . 

<? ■ r % 

istic-difficulty of the language of items. Their results indicated that mere 
• * *’ • • • - - v 

simplification of item language does not significantly , reduce tHe observed 
racial differences 



V . 



> 



22 



' The Present Study ■ , ' ; - . ' *: 

, ^ CSE chose to define- bias as, an item by race, interaction in an anal- 



ysis -of ^variance 'design. Subjects from various races would take one common 






V* V '.»* 

t. 1 1 



test, and the variations in their item' scores would be' analyzed r some of the - 

,• «..*■' • . 

- . * / * ' ' . ' . , 

variations would be due to differences in items'* (sane items. are more djfri- 

' ' • t 

cult, than, others) ; somfe , variation would be due to race (one group might have 

. * • . ' ' * 1 

higher apparent intellectual or achievement status than another); and some 

■ £ .. * 

' , . r 

yariation would be due to interaction of item and race (some items might be 
relatively more difficult, for one race than for another) , and would therefore 
show bias. ° 

In order ito obtain the initial information .about item by race interac- 
tion, the" Stanford Achievement Test, Paragraph Meaning sub test (Form W, ° " 
Primary II* Battery) was administered to 172 third -graders at twp integrated 



• / 



elementary schools in a large California school 0 district. The sample in- 

- ' ’ ' o - : • 1 . , • • • ' 

to 1 * 

eluded 26 white students, -20 blacks,,, 64 Mexican-Americans , and 37 Orientals. 






1'"' 



N. 



. »• ' , o 

Fifteen others were [delated from the analysis. This grouping of children is 

1 * f - v 

anthropologically, ’sociologically, and, ecpnomically impure, but does account 
fairly well for the constellation of characteristics frequently cited as in- 
volved in bias. The data would be used to validate the -prediction^, of item 

• i * * 

by race interaction and to uncover the item characteristics that appear to. 

•> *" » - 

be involved in effecting the apparent bias. . ' 

* ^ . r O 

F -ratios were obtained to determine which items showed significant dif- 

?■ 

ferences between ethnic groups . Of the 60 items on the subtest-, 21 items l 

showed differences hetwefen ethnic groups significant at the .001 level; thir- 

* . » 

teen other items were significant at the .01 level. Since -there were four 

groups (with six possible comparisons between pairs of groups), Duncan’s Multiple ; * j 

■ ^ t 
Range Test (Duncan, 1955 ; 1957) was applied to see which of the pairs of re-Latidfehips 



V-. 



ERIC 



J; 



& 

r '-C‘ 







w 









\ 






23 



were significant. . The results, presented in 'Table 1, show that most of the 
significant differences are dile to differences between the scores of Orien- 
tals vs. Mexican -Americans (37' items) and/or Orientals .vs. Blacks (28 items).. 

On only 3 items are there significant {differences between black and white. 

* « ' « 

The results for this particular sample indicate that there is considerable 

t' " ' , 

item by race interaction. This interaction may be due either to item bias or, 
to race characteristics (confounded by the'social and economic peculiarities of 

* * t 'V 

f* i 

this jq-vivo sample), or to both.' An Assumption that significant differences 
are due to bias inherent in the items would lead us to the implausibly conclu- 
sion that the test was drawn up to be biased in favor of Orientals, for few 

‘significant differences are found involving whites , and many are found involving'- 

. ■/ ^ ' • 

Orientals. . . . 

‘ ' * It is more likely that the results of the study are influenced unduly by 
the ’characteristics (racial, social, or economic) of the sample a used. There 
was no control for socioeconomic status, intelligence, or any variable other 
than observable race. Because these variables were not controlled, we ‘are 

dealing with an item by' (race + IQ + SES) interaction, with no way to* separate 

$ 

i* • , r, 

the. effects of the variables. 

Linear regressions (used to predict scores of one group , given the ' , 

' . . X 

, equations from the other) were computed for each of the six pairs of racial 

( / ' r ' 

groups. Results are reported in Table 2. The raw data were item difficulties 

* 

for the racial group on the 60 items in the test . Each of the regressions 

f o' 

involving the Oriental group had a relatively large intercept; this may have . 



been due to the. ceiling effect, limiting Oriental scores at the upper end. .The 
'mean score for the Oriental group was A perfect score on 12 percent of the items. 



Both the slope and the intercept are affected by this ceiling' effect, limiting 
the interpretability of .the regression equation. ' 



r;X 

■' $ 
3 

' ,7 

i 

I 



O 

ERJC 



' 



26 



r 



w 






.? 



o 

ERIC 



TABLE 1 



Item Means for Racial Groups, F-Ratios among the Means, 
and Post -Hoc Pair Comparisons between Racial Groups 



24 



Item 


r \ 

Item Means 


F-Ratio 

DF=153.3 


Significant Differences 


W 3 


B 


M 


0 


W 

B 


W W B B 

M 0 M O’ 


M 

15 


1 


.85 


.73 


.73 


.95 


2.70 




) 


* ■ 


2 


.85 


.60 


.72' 


,97 


5.63* 




( x 


X 


3 


.92 


.73 


.69 


.89 


3.27 . 


- 


\ 




4 


.96 


.70 - 


.77 


1.00 


6.18** 


X 


) X 


X 


5 


.88 


.83 

/ 


.63 


.89 


.35 








6 


.69 


.60 ' 


.69 


^ * 89 


2.74 








7 


.85 


.77 


.67 


^ .84 


1.67 








8 


.85 


.83 


.75 


.87 


0.84 








9 


.96 


.83 


.77 


‘.95 


3.10 








IQ 


.81 


.64 


.67 


.92 


*3.55 




X 


X 


11 


.73 


• .67 


.61 


, .89 


3.23 






X 


12 


.92 ■ 


.67 


.72 


.87 


2.82 








13 * 


.81 


.80 


.70 


.95 


2.94 






X 


14 


.96 


.83 


.88 


1.00 


2.62 








15 


.85 


.70 ~ 


.78 


’ 1.00 


4.29* 




X 


X 


16 


.96 


.73 


.75 


1.00 


5.93** 


# 


V X 


X 


, 17 


.81 . 


.43 


.61 


.89 


M4** 


X 


X 


X 


18 


, .77 


’ .60 


.66 


1.00 


6.91*p 




X 


X 


19- 


.85 


.60 


.64 


.97 


6.82** 




X 


X 


20 


.85 


.70 


.63 


.89 


3.68 




f . . 


X 


21 


;69 ‘ 


s .60 


.47 


.84 


5.10* 




( 


X 


22 


.92 


‘ ; .70 


.69 


.97 


5.82** 




> X 


X 


23 


.81 


.67 


.53 


.92 


6.81** 




X 


X 


24 


.77 . 


.70 


.70 


1.00 


4.96* 




X 


X 


25 


.77 


.47 


.53 


.92 


8.21** 




X 


X 


26 


.65 


.47 


.59 


•.95 


7.36** 




X 


9 

X 


27 


.69 


.50 


.67 


.89 


4.33* 




X 




28 


.69 


.47 


.61 


.89 


5.29* 




X 


X 


29 


.7^ 


.47 


.52 


.76 


3.38 








30 • 


.52 


.47 


.47 


.65 


1.16 






f ^ 


31 


.58 


.40 


.55 


.70 ' 


1\ 11 




* 




32 


.46 


-.40 


.36 


.62 


2.32 








33 


.62 


.40 


.50 


.78 


4.18* 




X 


X 


34 


.77 


.53 


.69 


.73 


1.46 








35 


.73 


.60 


.59 


.95 


5.65* 




X 


X * 



/ 



= white, N = 26; B = black, N = 30; M “ Mexican- American, N = 37; 0 = Oriental, 



87 



>2 

■ik 









N = 37 



25 



' Table 1 (continued) 



■ 

Item 


Item Means 


F-Ratio 


-Significant Differences 




W® f ■ M 0 


DF«153.1 


" w w m T5 B M 

i h n r 


36 


.58 .53 .39 .84 * 


■7 . 06** 


X 


37 


.62 .37 .42 . .87 


9'. 01** 


X X 


38 


.77 .63 • .73 1.00 


5.45* 


XX 


39 


.77 .47 .66 .95 


7.38** 


3 X X 


40 


.81 .53 .61 .95 


6.93** 


XX 


41 


.77 .53 ,47 '.84 


6.21** 


X . . . x 


42 


.69 .53 .42 .73 


3.92* 


j) x 


43 


.69 .40 .55 .89 


7.40** 


9 L x x 


44 


.73 .47 .56 , .89 


6 . 13** 


J XX. 


45 


.46 .23 .33 .60 • 


3^89 


« X X 


46 


.65 .57 .39 .73 


4.43* 


s 

X 


47 


.54 .43 .33 .76 


6.55**,. 


XX 


48 


.35 .33 .39 .68 


3.95* 


X XX 


49 


.54 • .40 .34 .51 


1.72 




50 


.54 . 30 . 41 , .62 

• 


2.87 


’ , • 


51 


.46 .23 .36 .40 


1.18 


i ,f> 

J ‘ 


52 


.58 .33 ' .34 . 65 ' 


4.28* 


X 


53 


.92 .43 .55 .68 


5.98** 


X X 


54 


.65 .33 .38 , .76 


7.25** 


X ^ X , 


55 


.58 . 37 . 44 . 92 


10.99** 


X X X 


56 


.38 . 30 .14 . 38 


3.31 




57 


,23 . 30 , .33 . 32 


0.30 ' 




58 


.27 .17 .19 .32 


1.14 


, 


59 


.54 .47 .39 .62 


1.81 


* . * 


60 


.23 .17 .27 .65 


8.72 


X * X X 



* significant at .01 F ^ 3.91 

** significant at .001 ^ . F^5.70 






Table 2 * 

item Regression Data for Six Racial Pairings 



Whites (W) and Blacks (B) 

Regression: B ■ .789 W - ,030 

Standard error of regression coefficient: .067 

Correlation^: .841 ' * 

Items more" than 2.58 standard error units from 1 regression 
: 0 , 'line: 8, 13, 17, 53 

Whites (W) and Mexican-Americans (M) 

Regression M ■ .797 W = .004 

Standard error of regression coefficient : 

Correlation^: .870 

Items more than 2 . 58. standard error units from 

line: 53, 56 

Whites (W) and Orientals (0) 

Regression: 0 = .764. W = *.279 

Standard error of regression coefficient: * 

’ Correlation^: .805 

Items more, than 2.58 standard error units from 
, ' line: 51, 53, 55, 56, 60 



.059 

regression 



.074 

regression 

-> ■ 



Blacks (B) and Mexican-Americans (M) 

Regression: M = .821 B » .118 

Standard error, of regression coefficient: 
Correlation™,: .851 

Items more than 2.58 standard error uni t§ from 
> line: 46, 56 



i 



,067 

regression 



M * 






Blacks (B), and Orientals (0) 

Regression: 0 = .727 B = .429 

Standard error of regression coefficient: ' .090 

\ Correlation.^: .727 

Items more than 2 . 58 standard error units from regression 
. line: 56, 57 A 



+ 






T 



Mexican-Americans (M) and Orientals (0) 

Regression: 0 * .846. M = .346 

.;-; N Standard error of regression coefficient: 
Correlation^: .816 

Items more than 2.58 standard error units from 

line: 51, 57 



.079 ‘ 



regression 




In linear regressioh, a graph of: the scores for each item should show 
most scores lying near the regression line. The scores for some items, 

, however, will be far from the Regression line because one group's score on 

an item may be much' greater 6r less than expected. When this occurs, we 

• •« ‘ ' • ■’ ■ 

must assume that there are- forces ‘at work affecting the group’s score on 

♦ . * . ^ 

that -item, whicK are” not affecting (as greatly} scores by other grodbs on 
that item. The criterion for deciding' that an item score is "far" from the 
regression line is that it be distant from the regression line greater than - 

2*58 times the standard error of the regression coefficient. This criterion 

* • 

eliminates 99% of the items in a normally , distributed sample. 

i ••••• . „• f 

The Stanford" Achievement Test - Paragraph Meaning consists of items 

• i ’.‘i 

that involve the examinee's making of logical implications from a reading 



selection, sometimes by referring back to specific; wordings , and then choos- 
ing words (from four relatively equally attractive alternatives) to "Cdapplete 
sentences continuing from the given selection. Since all^the items ratheS 
consistently conform to this description, it seemed useless to analyze Djlas 

effects Vn terms of item intellectual processes. Instead, the item contents 

* • ’ • •• • ■ 

were inspected to see if the subject matter of the item content, either 

' ‘ *’ ‘ f 

through knowledge, relevance, or interest, might meaningfully Correspond 

to score differences. Utilizing the regression approach that underlies 
’ Table 2, item contents for each racial group were Inspected. 

White/Black Item Differences . Black students score higher than expected 
on itjajs 8 and 13. Thesb items are concerned yrith television and cowboys on 
television. The white students score higher than expected on items 17 and 
53, concerned with sledding weather and supermarkets, respectivi^j . The dif- 
ferences uncovered in this comparison could be related to experience and 
relevance, to the separate groups . 







V 

&'■ 



.v 



•« . *.■*»« r- 

0 . 



T > 

. *4 i. - * 

/ . ' i\. 

It*,'."* -T-» 



/• 






28 



White/Mexican-American item Differences . White students do better than 
expected on items 53 and 56 of the test. The first item is concerned with 
supermarkets and the second Vith comparisons of physical size's of boys. The 
physical size item is, however, slightly tricky ---the wording leads one to 
an incorrect alternative . (it is possible that more socially threatened stu- 



dents will respond in the way that appears correct - and consequently be mis-. 

* * 
led). The group experience and relevance hypothesis could also be operating 

• ’ • • " \ 

in this comparison. 

White/Oriental Item Differences . Oriental students score higher than 
expected on items 55 and 60. t The' first item is highly inplicational and v is 
concerned with the misnomer of Greenland. Item 60 is concerned with copy- 
rights, “but also involves the making of abstract implications. White stu- 
. dents score higher than expected on items 51, 53, and 56; items concerned 

with f arming, supermarkets, and canparing of physical sizes of boys. While 

\ 

the experience -relevance hypothesis is reasonable for explaining the whites’ 
item-relative superiority, it is not very compelling as an explanation of 
the Orientals’ perfoimance . 

Bjack/Mexicah-American Item Differences . Black students do better than 
expected on items 46 and 56; ‘TfSms concerned with Mt. Vernon and comparing 
physical sizes of boys, respectively. With the exception that blacks may be 
more size -conscious than Mexican-Americans (doubtful , because of the pressures 
of "machismo" in the latJeT culture) the experience -relevance hypothesis does 
not hold up in this conparison. ^ ^ 

’ i ■ ... 

Black/Oriental Item Differences . Blacks score higher than expected on 
items 56 and 57, both concerned with comparing physical sizes of boys. It 7 $ 
is difficult to hypothesize a. unique relationship between racial size dif- 
ferences and item response consistencies. ' . ■ 



* -CJ 



ERIC 






\ 













29 



J 

i 



...I . 



• i 

i 



' i 

i 



' .1 

i 






■ i 



ERIC 



UHiBL . J' 



Mexican-American/Oriental Item Differences. Mexican-Americans score 



higher than expected on items 51 and 57; concerned with farming and; comparing 

physical sizes of boys, respectively. If the experience-relevance hypothesis 

holds here, we would have to note the consistent tendency for other groups 

jto_scoreJiigherJthanpredicted - on—i terns— involving -s ize^corrpar isonSTT when 7 “ 

' ’ • , • " • • 
they are compared to the Orientals^ 

*• . • » 

The evidence from this study is not overwhelmingly in favor of the 
(hypothesis that the differential familiarity, relevance, and interest 

f "* ‘ ‘ ‘ 

atousing aspects of items underlie the observed group differences. 



Summary 



v This report had as its original intent, the development of procedures . 
to codify the amount and nature of bias that inheres in standardized tescs, 
so that Center evaluations of the tests could be modified for different ra- 



cial-ethnic" groups- Various methods for establishing the existence and 
nature of test bia/s are discussed, with the conclusion that test bias can- 

r~ ' <*' m 

not be conclusively demonstrated in a wholly satisfactory manner. One . 
method was' nonetheless selected and applied to test items administered to 
two field-test schools for the purpose of ^investigating bias. The results 
of that small-scale study are discussed, but do not offer compelling reasons 
for the observed racial- ethnic <Jiff erences. • \ 






v 



A 






i / 






w. 



?;■ 



References 



30 



Bomstein, H. 5 Chamberlain, K. An investigation of . some of the effects of 
"Verbal Load" in achievement tests. Research Bulletin RB - 69 >94. 
Princeton", N. J.: Educational Testing Service, 1969. 

Cleary, T. A. Test bias: Prediction of grad&. of. Negro* and white students 
> in intergrated colleges. Journal of Educational Measurement, 1968, 

;S. -115-124.-- . — - . . ' . ‘ ~ 



Cleary, T. A. § Hilton, T. L. An investigation of item bias . Educational and 
. Psychological Measurement , 1968, 28, 6i-75. 

. \ 

Duncan, D. B* tailtiple range hnd multiple F tests. Biometrika, 1955, ll,: 1*42. 

1 * ’ , X 1 

Duncan, D. B. Multiple range tests for. correlated and heteroscedastic- 
. v means. Biometrika , 1957, 13 , 164-176. * - 

Eagle, N. 6 Harris, A. S. Interaction. of. race. and. test on reading perfoimance 
scores. Journal of Educational Measurement , 1969, &, 131-135. 

.Einhom, H. J. 5 Bass, A. R. Methodological considerations, relevant to 
discrimination in employment testing. Psychological Bulletin, 1971, 

75, 261-269. 



El Abd, H. A. The intellect. of East African students. Multivariate Behavioral 
Research , 1970, 5^, 423-434. • . . ” 

I •' ■■ . # 

Goolsby, T. M. 5 Frary, R. B. Validity of the Metropolitan (Readiness Test 
for white, and. Negro, students, in a southern city. Educational and 
Psychological Measurement , 1970 , 30 , 443-450. . _ 

• • 

Guthrie, G. M. Structure of abilities in a non-western culture. Journal of 



Educational Psychology , 1963, 54, 94-103. 



■ g * 

Hills, J. R. 8 Stanley, J. C. Easier test. improves. prediction of black 

students’ -college grades* Journal of Negro Education , 1970 / 39, 320-324. 

/ . ' ' 

Hoepfner, R., Strickland,.G.,.Stangel,.G.,. Jansen, P.,6 PatalinS, M. * 

CSE elementary school test evaluations. 1971. Los Angeles: Center 

for the Study of 1 . Evaluation.. UCLA. ’ / i 

Johnson, M. Factorial univariance of African educational abilities and aptitudes . 
Research Bulletin 69-3. Princeton, N. J.: Educational Testing Service, 1969. 

Linn, R. L. 6 Werts, C. E. Considerations, for ..studies, of test bias. 

Journal of Educational Measurement , 1971, 8, 1-4/. V 

~ V " ■ / . • 



f 

/ ' 



-* . 



31 



X l- 



% i S 

,. K. G . Factorial invariance, of ability measures in 



McGaw, B: 5 
groups dii 

Bulletin, RB: - 70 - 63. Educational Testing Service, Princeton, N. J . , 

\m. , * 



dif ferine in intelligence and socioeconomic status. ' Research 



Millman, J. 6 Lindlof, J. The comparability, of fifth r grade norms of the 
California, Iowa, and, Metropolitan. Achievement Tests.. Journal of 
Educational Measurement , 1964, 135-137. , 



Mitchell , B . C . Predictive validity . of . the Metropolitan- Readiness Tests and 

and for 
:, 1967, 



the tairphy-Durrell Reading . Readiness . Analysis . for . white . and for 
Negro. pupils. Educational and Psychoj 



27, 1047-1054. 






t Ti 



, G. Test bias : Validity of . the SA7. for. Blacks, and .Whites in. thirteen 
intergrated institutions. . .Colleger Entrance. Examination Board Research \ 
and Development Reports ? 70. “ TV, No. . 6. Berkeley, California: " ' 
Educational Tfesting Service* -197/. 



Thorndike, R* L. Concepts, of- cultured-fairness-. — Journairof Educational 



*, 1971, 8, 63-70. 



•Vandenberg, S. G. The primary mental. abilities. of Chinese students:- A * 
comparative study, of . the, stability of factor structure . The Annals 
of the New York Academy of Sciences , 1959, 79, 257-304. 



Vandenberg, S. G. The primary mental abilities of . South American students^' 

A second comparative. study, of. the. generality of" a cognitive factor structure. 
Hiltivariate Behavioral Research, 1967, 2, 175-197, 



^ V 



f 

f. » 



■ -S 



r .. 
; 



34 










